MORSE THEORY IN TOPOLOGICAL DATA ANALYSIS 



HENRY ADAMS, ATANAS ATANASOV, AND GUNNAR CARLSSON 

Abstract. We introduce a method for analyzing high-dimensional data. Our approach 
is inspired by Morse theory and uses the nudged elastic band method from computational 
chemistry. As output, we produce an increasing sequence of cell complexes modeling the 
dense regions of the data. We test the method on several data sets and obtain small cell 
complexes revealing informative topological structure. 
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1. Introduction 

The analysis of large sets of high-dimensional data is a fundamental problem for all 
branches of science and engineering. Regression analysis can be used effectively when the 
data has a linear or low degree polynomial structure based on a choice of model for the data. 
However, it is often the case that the data is genuinely nonlinear and that there isn't an 
obvious choice for how to model it. The purpose of topological data analysis Carlsson||2009 



is to provide methods which produce simple combinatorial representations of the data. 

In this paper we construct a cell complex representation in which the cell structure depends 
on the density of the points in the data set. We adapt the mathematical formalism of Morse 
theory^ which in its idealized form constructs a cell decomposition of a manifold using sublevel 
sets of a function on the manifold (called the "Morse function" ), to the setting of point clouds, 
i.e. finite sets of points in Euclidean spaces. The specific features of our approach are as 
follows. 

• We use a density estimator as our analogue of the Morse function. One could also use 
other intrinsic functions on the geometry, such as notions of data depth, to obtain 
other compact representations. 



• In order to sample cells from the analogue of the Morse skeleton of the Morse complex, 
we adapt the nudged elastic band method (NEB) from computational chemistry 
[Jonsson et al. 1998; Henkelman and JonssonpOOOj to our situation. NEB has been 
used to study high-dimensional conformation spaces of complicated molecules, and 
has typically used an energy function as the analogue of the Morse function. 

• We produce an increasing sequence of cell complex models. In accordance with the 
idea of topological persistence, this output gives a more accurate representation of 
the data than the choice of any single complex. 

• For the moment, we construct only the one-dimensional skeleton of the cell complex. 
Producing higher-order cells will require more difficult mathematics, since the mini- 
mization problems involved in the construction of such cells are challenging, and are 
related to minimal surface problems in geometric analysis. 

In studying data sets computationally, one finds that outliers will generally obscure topo- 
logical features. One approach for dealing with outliers is threshholding by density, i.e. 
studying superlevel sets of a function that estimates density. This is the approach taken by 
Carlsson et al. [2008 and Adams and Carlsson 2009| , for example. One difficulty is that 



the superlevel sets of density frequently require large numbers of points to represent them, 
since they are "codimension zero" subsets. Consider Figure [T} which contains a standard 
Morse theoretic picture: a sublevel set of a torus in with Morse function given by the 
z-coordinate. As is standard in Morse theory, the sublevel set is homotopy equivalent to 
the Morse skeleton, which is the dotted loop in Figure [T] Note that in order to achieve 
an accurate representation of the homotopy type, sampling from the entire sublevel set will 
require many more points than sampling from the Morse skeleton. The goal of this paper 
is to demonstrate that one can effectively sample cells from the Morse skeleton of a density 
function on the data set, thereby obtaining a more economical representation of the topology 
of the data set. 



Figure 1. A sublevel set of a torus in with Morse function given by the 
z-coordinate. The dotted loop is the Morse skeleton. 

We demonstrate how the method works on several nonlinear test data sets, including sets 
arising in social networks, in image processing, and in microarray analysis. The results 
assist in our qualitative understanding of the data, and we discuss how this qualitative 
understanding may then be used. First, while it is common to cluster social network data, 
there is more understanding to be gained about the structure within a single cluster or about 
the relationships between different clusters. As illustrated with the social network data, our 
method can recover extra information. Second, the qualitative features we identify in image 
processing data have applications in compression and texture analysis. Third, microarray 
analysis has been used to identify genes which are part of the cell cycle Spellman et al.||1998 



Alter et al. 2000; Whitfield et al. 2002 



The expressions for these genes are recurrent but 
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not periodic: the amplitude and period of the expressions vary with time. One would like 
methods for detecting recurrent data which are robust to such changes. More generally, one 
may want to be able to recognize recurrent data in which the sampling times are irregular, 
spaced further apart than the period, or unknown. Our method offers tools in this direction. 

We describe related work in Section [2| we provide background material on cell complexes, 
Morse theory, and NEB in Section [3| we describe our method in Section [4| we illustrate the 
results on test data sets in Section [5| and we conclude in Section |6| 

2. Related w^ork 



Morse theory has previously been adapted to discrete and applied settings. Forman 



studies discrete Morse functions that assign a single value to each cell in a complex. 



2002 



Edels- 



brunner et al. [2003] stud y piecewise linear Morse functions defined on triangulated mani- 
folds, and|Gyulassy et al. 2007J use a similar framework to analyze scalar functions arising 



in data analysis. By contrast, we will work with point clouds instead of triangulated spaces. 
Given a point cloud data set drawn from an unknown probability density function, one may 



cluster the data by approximating the basins of attraction of the modes of density Wishart 



1969; Hartigan 1975, pp. 205]. The goal of Stuetzle and Nugent! |2010| is to recover the full 
cluster tree of the density function, which describes how the connected components of the 
superlevel sets merge as the density threshold value decreases. Regardless of its topology, a 
connected component of a superlevel set is always represented in the cluster tree by a single 
point. There is more information to be gained by considering the topology of the superlevel 
sets. For example, some superlevel sets of the density function in Figure [2] have circular 
topology, but this is not apparent in the cluster tree. 





Figure 2. A probability density function on the left and the cluster tree for 
its superlevel sets on the right. Qualitatively, the function has a circular base 
and three bumps of varying heights. For a certain choice of density threshold 
the superlevelset of the function has the topology of a circle, but this is not 
refiected in the cluster tree. 



In topological data analysis one often models a superlevel set of densit y with a simplicial 



complex, such as the Cech, Vietoris-Rips, Alpha, or witness complexes 



Miicke 1994; de Silva and Carlsson 2004 



Edelsbrunner and 



descriptors of a superlevel set: 



These simplicial complexes alone are not useful 
they typically contain thousands of simplices and hence 
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are too large to interpret by hand. However, each of the Cech, Vietoris-Rips, Alpha, and 
witness complexes has a connectivity parameter that one can vary to produce an increasing 
sequence of simplicial complexes. One can compute the persistent homology of this increasing 
sequence to approximate the homology groups of the superlevel set [Edelsbrunner et al.|2002 



Zomorodian and Carlsson 2005; Edelsbrunner and Harer 2010 



Though homology is a useful signature for describing the geometry of a superlevel set, 
in general there are several possible models with given homology groups, and determining 
which model best fits the data is a supervised step. One may use our Morse theoretic 
method to identify models which match not only the homology groups but also the geometry 
of the data, and so our method plays a complementary role to persistent homology. For 
example, Carlsson et al. [2008 ] use persistent homology to identify an image processing data 
set with rank(i7o) = 1 ^iid rank(i7i) = 5, where Hi denotes the i-th homology group. There 
are many spaces satisfying these homological constraints, including a wedge sum of five 
circles. Carlsson et al. propose a particular space: a model containing three circles with four 
intersection points (Figure fTl). In Section 5.2 we use our method to obtain this three circle 



model directly. Our model contains only 12 cells, much fewer than the number of simplices 
needed in a witness complex reconstruction. 

Whereas Carlsson et al. [2008 and [Adams and Carlsson [2009 study only one superlevel 
set of density at a time, the approach of Chazal et al. [2011 is similar to ours in that they 
study multiple superlevel sets with varying threshold values simultaneously. 



3. Background 

We briefiy introduce three topics: CW complexes, Morse theory, and the nudged elastic 
band method. 

3.1. CW complexes. A CW complex is a type of cell complex. For A; a nonnegative integer, 
a /c-cell is the closed ball G | \\y\\ < 1} of dimension k. So a 0-cell is a point, a 1-cell is 
a line segment, a 2-cell is a disk, etc. A CW complex is a topological space formed by the 
following inductive procedure. The 0-skeleton Vl^(^) of Vl^ is a set of 0-cells. The 1-skeleton 
Vl^^^^ is formed by gluing the endpoints of 1-cells to the 0-skeleton, and can be thought of 
as a graph. Inductively, we form the /c-skeleton Vl^(^) by gluing the boundaries of /c-cells to 
the (k — l)-skeleton W^^~^\ If is a finite-dimensional CW complex, then this process 
terminates and we have W = W^^"^ for some k. See Figure [3] for an example and Hatcher 
[200^ 



for further details. 



3.2. Morse theory. The following introduction to Morse theory is informal; see [Milnor 
1965 for a thorough treatment. Suppose M is a compact manifold of dimension d and 
/: M ^ M is a smooth function with non-degenerate critical points mi, ... , ruk G M that 
satisfy 

cto < f{mi) < ai < /(ms) < • • • < a^-i < /(m^) < a^. 

The index of a critical point rrti is the number of linearly independent directions around rrii 
in which / decreases. So a minimum has index 0, a maximum has index rf, and a saddle point 
has index between 1 and d—1. Let Ma = /~^((— oc, a]) be the sublevel set corresponding to 
a G M. Morse theory tells us that each Ma- is homotopy equivalent to a CW complex with 
one A^cell for each critical point m^. In particular. Ma- is homotopy equivalent to Ma-_^ 
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Figure 3. A stick figure represented as a CW complex containing seven 0- 
cells, seven 1-cells, and one 2-cell. The 0-skeleton is on the left, the 1-skeleton 
is in the center, and the full 2-skeleton is on the right. 

with a single A^-cell attached. For instance, M^^ is homotopy equivalent to a point and is 
obtained from M^q, the emptyset, by attaching a single 0-cell. Figure [4] contains an example. 




Figure 4. Morse theory example. {Top) Manifold M is a torus and function 
/: M ^ M is the height function. There are four critical points mi, . . . , 
with indices 0, 1, 1, and 2 respectively. Let Qq < /(mi) < ai • • • < f{m^) < a^. 
{Bottom) Moving from left to right, we attach cells of dimension 0, 1, 1, and 
2 in order to obtain Ma- from Ma-_^. 



Though Morse theory is traditionally stated in terms of sublevel sets, there is an equivalent 
formulation in terms of superlevel sets. The superlevel sets of / correspond to the sublevel 
sets of — /, so m^ is a critical point of / with index if and only if m^ is a critical point 
of — / with index d — A^. It follows that superlevel set M^'-^ = /~-^([a^_i, oc)) is obtained 
from M^' by attaching a cell of dimension d — Xi. 

3.3. Nudged elastic band. To find saddle points we use the nudged elastic band method 



(NEB) from computational chemistry Jonsson et al. 1998; Henkelman and Jonsson 2000 



Consider a chemical system, e.g. several molecules, whose space of states is parametrized by 
and is equipped with a diflFerentiable map : ^ M encoding the potential energy of 
the system at each state. The local minima of E correspond to stable states, and chemists are 
interested in finding reaction paths between two stable states. A reaction path is a minimum 
energy path, whose points minimize E in all directions perpendicular to the path, and which 
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passes through at least one saddle point of index one. In NEB, a path is approximated as 
a piece wise linear path, which we call a band. Forces move the band towards a minimum 
energy path. A gradient force moves each node in the band along the component of —VE 
perpendicular to the band, and a spring force prevents the band from tearing apart. Extra 
forces can be added to smoothen or dampen the motion. 



4. Method 

We suppose data set X C is a finite sampling from an unknown probability density 
function g:W^^ [0,oc). We are interested in regions of high density, for if g{x) is very 
small for some x G X then x is regarded as a noisy datapoint not representing the main 
features. Therefore, we would like to understand the superlevel sets 

Y'^ = g-'{[a,oo)) = {y eR^ \ g{y) > a} 

containing all points in with density at least a G [0, oc). The superlevel sets encode 
how the dense regions of data set X are organized. 

We follow the ideas of Morse theory to build CW complex models approximating the 
superlevel sets y^. First we use X to build a diflFerentiable density estimate /: ^ [0, oc) 
approximating g. The 0-cells of our models will be local maxima of /. The 1-cells will be 
paths of high density between 0-cells, found using NEB. We suggest an adaptation of NEB to 
find dense 2-cells with boundaries in the 1-skeleton, and analogously for higher dimensions. 
Given a cell e C M^, let its density be inf{/(^) \ y G e}. We define to be the union of the 
cells with density greater than or equal to a. 

Note that Z^ C Z^ for a > b. A feature of our approach is that we produce the CW 
complex model Z^ for many values of a at once. This allows the user to observe how Z^ 



grows as a decreases. In accordance with the idea of topological persistence Edelsbrunner 


et al. 


2002; 


Zomorodian and Carlsson 


2005; 


Edelsbrunner and Harer 


2010], knowing 



over a range of values for a gives a better picture of the data set than knowing Z^ for any 



single choice of a. For example, in Section 5.1 we produce three different models — four 



vertices, a square loop, and a square disk — for a social network data set (Figure |6]). Each 
model represents the data at a different density threshold, and together they provide a more 
complete understanding than any single CW complex model. In addition, our nested output 
can be used to measure the significance of topological features. Suppose a 1-cell with density 
a appears in Z^ to form a new loop, and suppose a 2-cell with density b < a appears in Z^ 
to fill this loop to a disk. Then a — b measures how long this loop persists. If a — 6 is large 
then this loop is likely a significant feature of the data set, but if a — 6 is small then the loop 
may be the product of noise. 

Our method depends on the choice of several parameters. The main parameter which we 
find necessary to tune is the standard deviation used to build a density estimator. We use 
a consistent choice for all other parameters across all of our test data sets, which suggests 
that the other parameters are not as difficult to select. 

Several steps in our approach can be treated in a variety of ways. To name a few, we 
estimate density, find local maxima, generate random initial bands, simulate bands according 
to a formulation of NEB, and cluster. We use simple methods to handle each of these steps, 
but we leave open the possibility of substituting more sophisticated methods. 
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4.1. Density estimation. Using X, we build the differentiable density estimator /: ^ 
[0, oo) approximating g as follows. Let ipx^a - [0, oc) be the probability density of a 
normal distribution centered at x G with standard deviation a > 0. More explicitly the 
n X n covariance matrix contains along t he diagona l and elsewhere. We use the kernel 
estimator f{y) = \X\~^ J^xex ^x,a{y)- See Silverman 1986 for other possible estimators, 
including different kernels or adaptive choices of standard deviation. 

We regard the selection of a as the most important parameter choice. One may increase 
a to smoothen / or decrease a to expose local detail. 

4.2. 0-cells. To find 0-cells, we pick a random sample of points from X and fiow each along 
its gradient towards a local maxima of / using the mean shift iterative procedure [Fukunaga^ 
and Hostetler 1975; Cheng 1995|. Let ^ be one of the initial points. We define a 



sequence ?/o,?/i, 
defined by 



by setting i/i^i = m{yi)^ where the mean shift function m\ 



IS 



The vector m{y) — y is proportional to the normalized gradient V f{y)/ f{y)^ and the yi 
converge to a local maxima of /. 

It is necessary to cluster the convergent points to identify which represent the same 0-cell, 
and we select the densest member from each cluster as a 0-cell. We make the following 
consistent parameter choices with all test data sets. We say point yi has converged when 
||m(?/^)— y^ll is less than 10~^, and we perform single-linkage clustering with threshold distance 
0.3. 

4.3. 1-cells. To find 1-cells we use NEB, which we now describe in our data analysis setting. 
Our formulation is similar to [Jonsson et al. [1998 and Henkelman and Jonsson |2000l . A 



piecewise linear band is given by a list of nodes ^i, ^2, • • • , ^at, with endpoints Vi and 
in our set of 0-cells. Forces act on the intermediate nodes Vi with 1 < i < N while the 
endpoints remain fixed. The first task is to approximate a unit tangent vector at each 
intermediate node. Define = Vi^i — Vi and = Vj — Vj-i. We use a nai've tan gent 
estimate = {u^ + u7)/\\ut 



a more elaborate tangent. 



+ II given by averaging. Henkelman and Jonsson 2000 



use 



At each intermediate node Vi we define a total force 



(1) Fi = cVf{vi)\± + (ll^i+ll - ll^i- ||)r^ + smoothing. 

The expression Vf{vi)\± is the component of V f{vi) perpendicular to the tangent r^, and is 
called the gradient force. The gradient constant c adjusts the strength of the gradient force. 
To normalize with respect to the maximum gradient of the normal distribution, we set 

c=(sup ||V^o>(l/)ll)"' = (^^)' 



e. 



The term (||t^^|| — ||t^^~||)r^ is the spring force. This name is slightly misleading, as the spring 
force neither enforces a natural spring length on each edge nor minimizes the length of each 
edge. Instead, the spring force aims to equate the lengths of adjacent edges in the band. 

In Eq. [1], smoothing stands for a smoothing force added to prevent kinks, which hinder 
convergence, from forming in the band. Let 9i be the angle between vectors u'l and u'^ . 
Typically these vectors are close to parallel and 9i is close to zero. Let < a < /3 < tt 
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be fixed angles, and let ha^/s' [0,7r] [0, 1] be zero for x < one for x > /5, and increase 
continuously from zero to one as x increases from a to 13. For instance, we set 

{0 if X < a 

(1 - cos(^|5^))/2 if a < X < /5 
1 ifx>/3. 

We define our smoothing force to be ha^(3{0i){u'l — u^)^ which moves Vi in order to decrease 
9i whenever 9i> a. 

Evolving the band amounts to numerically solving the system of first order diflFerential 



equations v[ = Fi. Figure 16 in Appendix [B| shows the motion of a sample band. Note 
we use the first derivative v[ rather than the second derivative. In the chemistry setting, it 
is appropriate to set acceleration proportional to the gradient of the potential energy. In 
our setting, we do not view V/ as a force in the literal sense but only as an indication of 
which direction to move in order to maximize /. As the first order equation is simple and 
gives good results, we use it. Nevertheless, we have also had success with the second order 
equation and do not dismiss its use. 

Let p and q be distinct 0-cells. To find the 1-cells between p and q we generate a sample 
of initial bands joining them, as described in Appendix |B] We evolve each band until it 
converges and discard non-convergent bands. We also discard convergent bands which pass 
too close to any 0-cell r 7^ g, as such a band should instead appear as the concatenation 
of two others. 

To identify which bands represent the same 1-cell between p and q we cluster the bands. 
We define a metric d^^q^N on bands of nodes starting at p and ending at g. If '^1, . . . , 'Ua^ and 
Ci, . . . , t^AT are two such bands, then let rfp,g,Ar({^i}, {vi}) = {N — 2)~^ '^iJ2 d{vi^ Vi). From 
each cluster we select the band with the highest density to represent the 1-cell. We estimate a 
band's density as min{/('Ui), . . . , /{vn)}^ though one could obtain a more accurate estimate 



by subdividing the band further or by using the climbing image method of [Henkelman et al. 
[2000]. 

We make the following consistent parameter choices for all of the test data sets. We 
include A = 11 nodes in the bands and let the smoothing force angle parameters he a = 7t/6 
and /3 = 7r/2. We set gradient constant c as above and say a band has converged when 
(A — 2)"-^ X]il:2^ \ Wi\\ is less than 10~^. If a convergent band is within distance 0.5 of another 
0-cell, we discard the band. We perform single-linkage clustering with threshold distance 
0.3. 

4.4. Higher-dimensional cells. In order to search for higher-dimensional cells, one can 
imagine adapting NEB. We discuss one approach, which we use with the test data sets, in 
Appendix [C| We also discuss several weaknesses of this approach. 

5. Results 

We test our method on five data sets whose sizes and dimensions are shown in Table [H 
Please see Appendix [A] for further information about the data sets. We have implemented 
our method with Java code, which is available along with four of the data sets at the following 



webpage: http : //code . google . com/p/neb-tda 



Table 1 . Data set information 





social network 


optical 


range 


optical flow 


gene 


size of data set X 


1,127 


15,000 


15,000 


15,000 


43 


dimension n 


5 


8 


24 


16 


30 


standard deviation'^ a 


0.45 


0.20 


0.35 


0.30 


1.35 



^This is not a property of the data set, but instead the standard deviation 
we use to estimate density. 



5.1. Social network. The National Longitudinal Study of Adolescent Health is a school- 
based study of American youth. In 1994-95 a sample of high schools and middle schools 
was chosen, and when possible each high school was paired with a sister middle school from 
the same community. The students from each school were asked to list up to five of their 
closest female friends who attend their school or their sister school, and likewise for their 
male friends. 

Moody [2001 creates network graphs from the survey data. Each student is represented 



by a vertex, and the edge between two students exists if each student listed the other as 
a close friend. We analyze the graph for the students from "Countryside High School," a 
pseudonym used by Moody, and its sister middle school. Of the 1,147 students from this 
community who participated in the survey, 20 shared no connections with any others and 
were removed from the graph. 

Figure [5] illustrates the following interesting structure in this graph. Most of the students 
can be placed into one of four groups, containing a majority of white high schoolers, non- 
white high schoolers, white middle schoolers, or nonwhite middle schoolers. There are many 
friendships between students in the same group. In addition, there are a significant number 
of friendships between groups of the same school or race category. However, their are very 
few friendships between groups with neither category in common. 





Figure 5. Social network for "Countryside High School" and its sister middle 
school. {Lefi) Cross vertices are students from the middle school and circle 
vertices are students from the high school. {Right) Cross vertices are white 
students and circle vertices are nonwhite students. A handful of students are 
without race data and their vertices are left unmarked. 



Our goal is to use our Morse theory approach to recover this structure. We use stress 
majorization, an optimization strategy in multidimensional scaling Cox and Cox 2001 , to 
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embed the vertices of the graph as a data set X C in a manner distorting the shortest 
path metric as httle as possible. The shortest path distance between vertices v and w is the 
fewest number of edges one must cross to travel from v to w. Each point x G X corresponds 
to a student, and we regard the resulting set X as our input data set. 

Our method finds four 0-cells, which correspond to the four groups of students. We find 
four 1-cells which form a square loop. The 0-cells have densities in [1.69 • 10~^, 2.02 • 10~^] 
and the 1-cells in [1.07 • 10"^, 1.54 • 10"^] Q We find a 2-cell filling the loop with density 
0.79-10-2. 

Recall CW complex is defined to be the union of the cells with density greater than 
or equal to a. For a G (1.54 • 10"^, 1.69 • lO^^) the model consists of four 0-cells 
(Figure |6]). Hence we recover the groupings of students based on school and race. For 
a G (0.79 • 10"^, 1.07 • 10"^) the model Z^ is a square (Figure |6]). The square reveals 
that groups sharing school or race category are more closely linked than groups sharing 
neither. This suggests that it is more difficult to cross two cultural barriers than one. For 
a < 0.79 • 10~2 the model Z^ fills to a disk (Figure [6]), which is an appropriate representation 
of the data at a sufficiently coarse scale. Note that this increasing sequence of CW complex 
models provides a better understanding of the data set than any single model alone. 




Figure 6. Social network data, projected to a plane using principal com- 
ponent analysis (PCA). Complex Z^ grows as density threshold a decreases. 
From left to right, we have four 0-cells, a square loop, and a disk. 



van der Schaaf 



Leeetal. 2003 



5.2. Optical image patches. The optical image database collected by [van Hateren and 
[l998| contains a variety of indoor and outdoor scenes. From this database, 
select a large random sample of 3 x 3 patches. Each patch is thought of as a 
point in with coordinates the logarithms of grayscale pixel values. Lee et al. define a norm 
measuring the contrast of a patch and select the high-contrast patches. They normalize each 
patch by subtracting the average patch value and dividing by the contrast norm. Lastly, Lee 
et al. change to the Discrete Cosine Transform (DCT) basis {ei, 62, ... , eg}, which maps the 
patches to the unit sphere 5''' C M^. Note ei and 62 are horizontal and vertical linear gradients 



and 63 and 64 are horizontal and vertical quadratic gradients (Figure |T2| in Appendex A). 



Let A4 be the resulting set of high-contrast, normalized, 3x3 patches. Though Ai fills out 
all of ^ some regions of the sphere are more dense than others. 

use persistent homology to study Ai. 



Carlsson et al. 2008 



Using a family of density 
estimators they select a family of dense core subsets from Ai. They apply persistent homol- 
ogy, and for a global estimate of density their core subset has rank(i/o) = rank(i7i) = 1, 



-'^Occasionally we also find a 0-cell with density below 6-10 
affect our analysis. 
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but due to its low density value it does not 



where Hi denotes the i-th homology group. This core subset hes near the primary circle 
{aei + (3e2 \ (o^,^) G S^} containing linear gradients at all angles (Figure [t]). With a more 
local estimate of density, the core subset has rank(i7o) = 1 and rank(i7i) = 5. Carlsson et al. 
identify a three circle model matching this homology profile and the data. In addition to the 
primary circle, the three circle model contains two secondary circles, {aei+I^C'^ \ (a, (5) ^ S^} 
and {ae2 + /5e4 | (a, ^) G S^}^ which include quadratic gradients in the horizontal or vertical 
direction (Figure [7|. The primary circle reflects nature's preference for linear gradients in all 
directions and the secondary circles reflect nature's preference for the horizontal and vertical 
directions. 




Figure 7. Three circle model. The solid outer circle is the primary circle 
and contains linear gradients. The dotted and dashed inner circles are the 
horizontal and vertical secondary circles which contain quadratic gradients. 
Each secondary circle intersects the primary circle twice, but the secondary 
circles do not intersect each other. 



For this paper we select a random input data set X C of size 15,000. We flnd four 0-cells 
located near the four most common patches ±ei and ±62- Between each of the four 0-cell 
pairs {ei, 62}, {62, — ei}, {— ei, —62}, and {—62, ^1} we flnd a quarter-circular 1-cell. Together 
these form the primary circle. Between the pair {ei, — ei} we flnd two semicircular 1-cells 
forming the horizontal secondary circle, and between {62, —^2} we flnd two semicircular 1- 
cells forming the vertical secondary circle. The 0-cells have densities in [2.11, 2.36] the 
primary circle 1-cells in [1.17, 1.28], and the secondary circle 1-cells in [0.33,0.38]. For a G 
(1.28,2.11) our model is the four most common patches, for a G (0.38,1.17) it is the 
primary circle, and for a < 0.33 it is the three circle model (Figure [s]). 

[Carlsson et al. 2008 discover a 2-dimensional Klein bottle surface that contains the three 
circle model as its backbone and that flts a core subset of Ai. The Klein bottle model is a 
good example of how data set models can be used not only to improve understanding but 



Occasionally we also find a 0-cell with density below 0.08, but due to its low density value it does not 
affect our analysis. 
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Figure 8. (Left) Optical image patches X and primary circle complex for 
a G (0.38, 1.17), projected to the 6162 plane. (Center) Three circle model 
for a < 0.33, projected to the plane spanned by 61 + ^64 and 62 + 1^3 so that the 
secondary circles are visible. (Right) The densest 8,500 points of X, projected 
to the 6162 plane. The primary circle appears clearly and the projections of 
the secondary circle patches form a faint cross. 



also to improve applications. As a low-dimensional manifold, the Klein bottle can be used 
in image compression schemes [Carlsson et al. 2008 . In addition, the Klein bottle model is 



being used to identify and analyze optical image textures Perea and Carlsson 



5.3. Range image patches. In a range image each pixel stores the distance between the 
laser scanner and the nearest object in the corresponding direction. Lee et al. 12003 select 



a large random sample of log- valued, high-contrast, normalized, 3x3 range image patches 
from the Brown database, which contains a variety of indoor and outdoor scenes. They 
observe that the patches cluster near binary patches, where the binary values correspond to 
foreground and background. 

Though the largest clusters are arranged in the shape of a circle, the 3x3 binary patches 
are too coarse to fill the full circle. Adams and Carlsson 2009 consider 5x5 patches, 
preprocessed in manner similar to Lee et al. 2003|. They obtain a large sample TZ of high- 
contrast, normalized, 5x5 range image patches. The 5x5 DCT basis now contains 24 
vectors {ei, 62, • • • , ^24}, where Ci and 65 are horizontal and vertical linear gradients (Figure 
13 in Appendix [a]). With persistent homology they find that a dense core subset of TZ is well 
modeled by the range primary circle {aei + (3e^ \ (a, (5) G S^}. 

For our Morse theory approach, we pick a random subset X C 7^ of size 15,000. We find 
four 0-cells near ±ei and ±65. Three of these 0-cells have densities in [0.59,0.75] while the 
0-cell near ei has density 1.47. This refiects the fact that many range patches are shots of 
the ground and are hence near the horizontal linear gradient given by ei. We find four 1-cells 
forming a loop with densities in [0.51,0.64], and a 2-cell filling the loop with density 0.39. 
For a G (0.75, 1.47) our model Z^ is the single 0-cell near ei, and for a G (0.39,0.51) our 
model Z^ is the range primary circle (Figure [9]). 

Any method in data analysis must tolerate some amount of noise, that is points x G X 
with low density. One may try to remove noisy points from the data set, but in general 
it is not easy to remove noise without also removing features. Because we use data set X 
primarily to build density estimate / : ^ M, we believe that noise which is sparse enough 
to not significantly affect / will not significantly affect our CW complex output. As evidence 
for this claim, consider the large samples Ai^TZoi optical and range image patches. [Carlsson 
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Figure 9. (Left) Range image patches X and range primary circle for 
a G (0.39,0.51), projected to the 6165 plane. (Right) The densest 9,000 points 
of X. 



et al. 2008 and Adams and Carlsson 2009 remove noise from Ai and TZ by studying only 



core subsets of points with densities in the top 30%. However, in this paper we work with 
random subsets of Ai and TZ containing both dense and noisy points. 

5.4. Optical flow patches. A video records a sequence of images, and the apparent motion 
in the images is called optical flow. At each frame, the optical flow is represented by a vector 
fleld with one vector per pixel that points to where that pixel appears to move for the 
subsequent frame. No instrument measures optical flow directly, and estimating optical flow 
from a video is an ill-posed problem. However, Roth and Black |2007| create a database 
of ground-truth optical flow by pairing range images with camera motions and calculating 
the produced optical flow. The range images are from the Brown database, and the camera 
motions are retrieved from a database of videos from hand-held or car-mounted cameras. 

With preprocessing steps analogous to Lee et al. (2003 1 we create a large sample 
of high-contrast, normalized, 3x3 optical flow patches. We change to the DCT basis 
{e^, . . . , eg, e^, . . . , eg}, where the superscript u denotes flow in the horizontal direction and 
V denotes the vertical direction (Figure 14 in Appendix [A]) . We select X to be a random 
subset of of size 15,000. 

We flnd four 0-cells near ±e^ and and four 1-cells which form a loop. These cells have 
densities in [0.71, 2.79]. We find a 2-ceU filling the loop with density 0.39. For a G (0.39, 0.71) 
our model recovers the horizontal fiow circle near {ae^ + /3e2 \ (o^, P) G S^} (Figure 10). 
Note that the horizontal fiow circle is obtained by applying horizontal camera motion to 
range patches from the range primary circle. Also, one expects horizontal camera motion to 
be more common than vertical motion in hand-held and car-mounted videos. Therefore, the 
horizontal fiow circle combines important patterns from both the range image and camera 
motion databases. 



5.5. Gene expression data. Spellman et al. 1998 , Alter et al. 2000 , and Whitfield et al. 



[2002 1 identify genes associated with the cell cycle using microarray analysis, and the ex- 
pression levels of these genes are cyclical. One way to model cyclic behavior mathematically 
is with a periodic function. However, while a periodic function has a fixed amplitude and 
period, cyclic behavior in biology is not as rigid. The amplitude and period of cell cycle 
gene expressions change with time. In particular the amplitude tends to decrease, which 
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Figure 10. (Left) Optical flow patches X and horizontal flow circle for 
a G (0.39,0.71), projected to the 6^63 plane. (Right) The densest 6,000 points 
of X. 



can be interpreted as the dephasing of initially synchronized cells. Another way to model 
cyclic behavior mathematically is with a circle, and this topological representation is robust 
to changes in amplitude and frequency. 

Whitfield etal] [200 2 1 measure gene expression levels as HeLa cells proceed through the cell 



cycle. We select genes whose measurements pass several quality thresholds and which deviate 
significantly from the average expression. Our thresholding is stringent as the purpose of 
this exercise is not to discover new biology but to test our method on time series data. 
We normalize and cluster the resulting genes. One cluster consists of six genes which are 
well-known to be part of the cell cycle and which have periodic expression levels, as shown 



in Figure 15 in Appendix A For time point t G {0, 1, . . . , 46}, let Wt be the vertical vector 
of length six containing the expression levels of these genes at time t. We use blocks of five 
time points to form our data set. We set X = {xq, . . . , X42} C M^°, where 
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We apply our method to X and find three 0-cells and three 1-cells which form a circle. 
These cells have densities in [9.4 • 10~^^, 12.2 • 10~^^]. We find a 2-cell filling the circle with 
density 4.0 • lO"-*^^. The model for a G (4.0 • 10"-^^, 9.4 • 10"-^^) is a circle which recovers 



the cyclical nature of the gene expressions (Figure 11). 



6. Conclusion 

We have introduced a method for finding structure in high-dimensional Euclidean data 
sets. Following Morse theory, we use the nudged elastic band method to sample cells from 
the analogue of the Morse complex determined by the density function. We produce a 
nested family of CW complex models representing the dense regions of the data. We test the 
approach on a variety of data sets and find compact complexes revealing important nonlinear 
patterns, suggesting that the method may be a helpful tool for understanding new data sets. 
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Figure 11. Gene expression data set X, projected to a plane using PCA. 
Complex Z"" is a circle for a e (4.0 ■ 10-^^ 9.4 ■ 10"^^). 



Appendices 



Appendix A. Additional data set information 

We have implemented our method with Java code which is available at the following 
webpage: http://code.google.eom/p/neb-tda. The optical image patches, range image 
patches, optical flow patches, and gene expression data sets are also available at this webpage. 
The raw data for the social network data set may be obtained as explained in the following 
section. 

We provide further information about each of the test data sets. 

A.l. Social network. More information is available at the Adolescent Health Study web- 
page: |http : //www . cpc . unc . edu/pro j ect s/addhealth[ To obtain a copy of the non-linkable 
Adolescent Health Network Structure files, please contact addhealthQunc . edU| 



A. 2. Optical image patches. The optical image database collected by [van Hateren and 



van der Schaaf |1998| contains 4,167 grayscale images from indoor and outdoor scenes, each 
1020 X 1532 pixels (Figure [T2I). More details and the database itself are available at http: 



//www. kyb.mpg.de/bethge/vanhateren/index.php. From this database, Lee et al.^ [2003J 



create a set of high-contrast, normalized, 3x3 patches through the following preprocessing 
steps. 

(1) Lee et al. select a large random sample of 3x3 patches with coordinates the logarithms 
of grayscale pixel values. 





X4 


X7 








X3 


X6 





Each patch is represented by a vector x = (xi, . . . , Xg)^ G M^. 
(2) Lee et al. define a norm || measuring the contrast of a patch. Two coordinates 
Xi and Xj of x are neighbors, denoted i ^ j, if the corresponding pixels in the 3x3 
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Figure 12. {Top) Sample optical images from the van Hateren database. 
{Bottom) Vectors ei, 62, . . . , eg from the 3x3 DCT basis. 



patch are adjacent. Let \\x\\d = y'^Zir^ji^i ~ ^jY- Lee et al. select the patches with 

contrast norm in the top 20% of their sample. 

(3) Lee et al. normalize each patch by subtracting the average patch value and by dividing 
by the contrast norm. This maps the patches to a 7-dimensional ellipse. 

(4) Lee et al. change to the Discrete Cosine Transform (DCT) basis {ei, . . . , 63} for 3 x 3 
patches. The basis vectors are normalized to have contrast norm one, and so this 
maps the patches to a 7-dimensional sphere. 

Let Ai be the resulting set of high-contrast, normalized, 3x3 optical patches. [Carlsson 
et al. 2008 study dense core subsets from M. using persistent homology. In this paper we 



select a random test data set X C of size 15,000. 



A. 3. Range image patches. The Brown range image database by Lee and Huang is a set 



of 197 range images from indoor and outdoor scenes, mostly 444 x 1440 pixels (Figure 13). 
The operational range for the Brown scanner is typically 2-200 meters, and the distance 
values for the pixels are stored in units of 0.008 meters. The database can be found at the 
following webpage: http : //www . dam . brown . edu/ptg/brid/index . html . 

From the Brown database we obtain a space of range image patches through the following 



steps, which are similar to the procedures of Lee et al. 2003J and Adams and Carlsson 2009 

(1) We randomly select about 4 • 10^ size 5x5 patches from the images in the database. 
Each patch is represented by a vector x G M^^ with logarithm values. 

(2) We compute the contrast norm \\x\\d = 'sjYlir^ji^i ~ ^jY Q^ch patch and select 

the patches with contrast norm in the top 20% of the entire sample. 

(3) We subtract from each patch the average of its coordinates and divide by the contrast 
norm. 
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Figure 13. (Top) Sample range images from the Brown database. (Bottom) 
Vectors ei and 65 from the 5x5 DCT basis are horizontal and vertical linear 
gradients. 



(4) We change to the DCT basis {ei,...,e24} for 5 x 5 patches, normalized to have 
contrast norm one. This maps the patches to a 23-dimensional sphere. 

Let TZ be the resulting set of high-contrast, normalized, 5x5 range patches. Our test data 
set is a random subset X C 7^ of size 15,000. 



A. 4. Optical flow patches. The optical flow database by Roth and Black |2007 | contains 
800 optical flow flelds, each 250 x 200 pixels (Figure 14). This ground-truth optical flow 
is generated by pairing range images and camera motions. The range images are from 
the Brown range image database. The camera motions are retrieved from a database of 
67 videos from hand-held or car-mounted video cameras, each approximately 100 frames 
long. The camera motions are extracted using boujou software by 2d3 Ltd., available at 
http : // www . 2d3 . com. The Roth and Black database is available at http : / / www . gr i s . , 



inf ormat ik . tu-darmstadt . deZ-^sroth/ research/flow/ downloads . html 



From the Black and Roth database we create a space of optical flow patches, and our 

[20031 . 



preprocessing is similar to that of Lee et al. 



[1) We randomly choose 4 • 10^ size 3x3 optical flow patches from the Roth and Black 
database. Each patch is a matrix of ordered pairs, where Ui and Vi are the horizontal 
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Figure 14. (Top) Two sample optical flows from the Roth and Black data- 
base. Horizontal components are on the top and vertical components are on 
the bottom. White corresponds to flow in the positive direction {+x or +y) 
and black corresponds to the negative direction. (Bottom) Vector flelds e^^ 63, 
e^, and from the 3x3 optical flow DCT basis. Compare with the patches 



ei and 62 in Figure 12 



and vertical components, respectively, of the flow vector at pixel i. 

(ui.vi) (^4,^4) (^7,^7) 

{U2,V2) K,^5) (^8,^8) 
_{us,Vs) (uq.Vq) {uq,Vq)_ 

We deflne u = (t^i, . . . , uq)^ and v = {vi^ . . . ^ Vq)^ to be the vectors of horizontal and 
vertical flow components. We rearrange each patch to be a vector 




(2) We compute the contrast norm \\x\\d = yYlir^j ~ for all patches. 

We select the patches with contrast norm in the top 20% of the entire sample. 

(3) We normalize the patches to have zero average flow. More explicitly, given a patch 
X, let G have each entry equal to \ Y^i=i^i^ ^he average of the horizontal 
components. Let v be deflned similarly. We replace each patch x with 




We then divide each patch by its contrast norm. 

18 



(4) We change to the DCT basis {e^, . . . , e^, e^, . . . , e^} C M^^ for 3 x 3 optical flow 
patches, where 



and 



This maps the patches to a 16-dimensional sphere. 

Let T be the resulting set of high-contrast, normalized, 3x3 optical flow patches. Our test 
data set is a random subset X C J-" of size 15,000. 



A. 5. Gene expression data. IWhitfield et al.|f2002] run flve parallel time series experiments 
measuring gene expression levels as HeLa cells proceed through the cell cycle. Their experi- 



ments are available at http: //smd. Stanford. edu/cgi-bin/publication/viewPublicatio a 
|pl?pub_no =106&2 3766] We select a small set of high quality cyclically expressed genes 
through the following steps. Our thresholding is very strict as the purpose of this exercise 
is not to discover new biology but to test our method on time series data. 



(1) 

(2) 
(3) 

(4) 

(5) 
(6) 

(7) 

(8) 



For quality control, we include only the spots in which either 

• channel 1 mean intensity / median background intensity > 1.5, 

• channel 2 mean intensity / median background intensity > 1.5, or 

• the spot regression correlation between the channels > 0.5. 

We consider only genes which appear in all flve time series experiments, which contain 
at least 60% of the measurements in each experiment, and which contain at least 70% 
of the measurements in total. 

We now focus on the experiment "Double Thymidine Block Experiment 3" which 
contains hourly measurements for about 4.3 • 10^ gene elements over 46 hours, with 
two measurements at time zero. 

We KNN impute missing measurements [Troyanskaya et aL 2001| with A; = 10 using 
PAM software. 

We mean center each row (gene) and column (time point). 

We zero-transform the data by subtracting the mean of the two time zero columns 
from all other columns. 

We select the genes with at least three measurements greater than 1.5 in absolute 
value. Only 29 genes remain. 

We cluster the genes using the Pearson correlation centered metric. We select one 
cluster of six genes— CDC2, UBE2C, T0P2A, CCNF, AURKA, and PLKl— which 



are cyclically expressed (Figure 15). These genes are well-known to be part of the 
cell cycle. 

(9) We combine the two measurements at time zero into a single measurement by av- 
eraging. We now have a set of vectors {w^^ . . . ,1(^46} C M^, where Wt contains the 
expression levels of the six genes at hour t. 
(10) We use blocks of flve time points to form our test data set X = {xq, . . . , X42} C M^^, 
where 













/W42\ 
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Figure 15. Microarray data for the six genes CDC2, UBE2C, T0P2A, 
CCNF, AURKA, and PLKl. Each row is a gene and each column is a time 
point. Red corresponds to high expression and green to low expression. Ap- 
proximately three cell cycles are visible. 



Appendix B. Initial bands 

Let p and q be distinct 0-cells. Our method for generating the initial bands between p and 
q depends on whether data set X is a general data set in or whether X is normalized to 
lie on a unit sphere S^~^ C M^. 

For a general data set X C M^, we pick a random vector y from the set of all unit vectors 
perpendicular to p — which is a sphere of dimension n — 2. We also pick r G [0, d{p^ q)] 
uniformly randomly. The resulting initial band is N evenly distributed nodes along the 
circular arc (or straight line, with probability zero) between the points {p + q + ry) and 
q. 

If X is normalized to lie on a unit sphere S^~^ C M^, we generate initial bands lying near 
gn-i^ Though p and q are near dense regions of X they need not lie in X nor in S^~^ . Let 
p = p/\\p\\ and q = q/\\q\\. We pick a random vector y ^ p^q m S^~^. The plane defined 
by j5, and q intersects S^~^ in a circle. Let j5 = i)i, . . . , 'Oat = ^ be the unique band that 
is evenly-spaced along the this intersection circle, that starts at j5, that ends at g, and that 
does not pass through y. We define our initial band to be p = ^'i, . . . , ^'at = Q', where 



[N-i)\\p\\ + [i-l)\\q\ 
N -I 



In general, one may be interested in finding dense 1-cell loops which start and end at the 
same 0-cell. We did not search for loops in our test data sets. See Figure [16] for a sample 
1-cell trial. 
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Figure 16. Sample 1-cell trial from the gene expression data set. The initial 
band is in red and the convergent 1-cell is in blue. The lines which fade from 
yellow to green trace the paths of the intermediate nodes. The gene expression 
data is projected to the plane using PCA. 



Appendix C. Higher-dimensional cells 

One can imagine adapting the NEB method to search for higher-dimensional cells. Given 
an initial fc-cell with A; > 1, one would like to move it towards a maximum density fc-cell. 
We describe a straightforward approach which we use with our test data sets. We model 
a /c-cell as a graph G = (V^ E) with all edges simple. Given a node a ^ V ^ let G 
denote its position and = {/3 G ^ | {o^,/3} G E} denote its set of incident vertices. We 
estimate the fc-dimensional tangent space at to be the span of the first k components of 
a principal component analysis on {vj^ — '^a | /3 G K^}. Specified boundary nodes lie in the 
[k — l)-skeleton of our CW model and remain fixed, but if a is not a boundary node, then 
we define 

to be the force at vertex a. As before, V f{ya)\^ is the component of Vf{va) perpendicular to 
the tangent space and is called the gradient force. Gradient constant c adjusts the strength 
of the force. The term "^^/sevA^f^ ~ '^«) spring force. We numerically solve the system 

of first order diflFerential equations v'^ = F^. 

We point out several weaknesses in the straightforward approach above. 

• The spring force does not generalize the dimension k = 1 case. 

• The gradient forces and spring forces do not both go to zero; instead, they balance 
against one another. This means that the cell does not converge exactly to the 
maximum density cell. A possible remedy is to project the spring force to the tangent 
space. One may then need to add an appropriate smoothing force to prevent kinks 
from forming in the cell. 

• It may be preferable to model a fc-cell not as a graph with forces acting on the nodes 
but as a fc-dimensional simplicial complex with forces acting on simplices. 
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• One may want an adaptive representation of a cell whose triangulation changes as the 
cell moves. Otherwise, the choice of an initial triangulation may affect the subsequent 
motion of the cell. 

We leave the exploration of improved generalizations of NEB in higher dimensions as the 
subject of future work. 

We search for 2-cells in four of our test data sets. We form a web-shaped graph with 20 
nodes on each of 10 concentric rings. Let dV C V denote the boundary nodes which lie in 
the 1-skeleton and remain fixed. We initially place the center node at the average of the 
boundary nodes, and we linearly interpolate between the boundary and center to place the 
other nodes. We set constant c as in the case of and 1-cells, and we say a cell has converged 



when \V \ dV 

miuaev fi"^ a)- See Figure [17] for the resulting 2-cells. 



aev\dv 



is less than 10 . We estimate a convergent cell's density as 







Figure 17. The 2-cells in the test data sets. {Top left) Social network data, 
projected to a plane using PGA. {Top right) Range image patches, projected 
to the 6165 plane. {Bottom left) Optical fiow patches, projected to the 6^62 
plane. {Bottom right) Gene expression data, projected to a plane using PGA. 
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